In the last part of the module, we were building/estimating a correlation between GPAs and SAT scores so we would have a tool for guidance counselors to predict SAT scores given a student’s GPA. Or, maybe we were hired by real estate agents to create a tool to predict a home’s sale price given some features about the property.Or, maybe we want to estimate the impact some government program like SNAP has on employment outcomes for citizens.How can we use correlations to inform us?
Conditional Mean
Let’s suppose today is your first day as a guidance counselor and your first student comes into your office and says, “what do you think I’ll get on my SAT?” Unfortunately, you do not know anything about them, so the best guess you can come up with is… the average.
However, if the student told you that they have a 2.8 GPA, you would likely adjust your guess downwards.1
By adjusting your guess downwards, you are implicitly calculating a conditional mean.When we know nothing, the best guess we can come up with is the mean of the entire distribution. We talked about this in Module 2. However, if you have some additional information (e.g., GPA) before making your guess, your guess can only improve. We can demonstrate this with a simple simulation.
Conditional Mean
Let’s revisit the data that contained all-star basketball player heights. If you told me that you were going to select a player at random and ask me to guess their height, the “best” number to pick would be the mean of the data. Like we did with standard deviation, let’s define how wrong we are by the squared difference of our “mistake”, or error. Let’s simulate this:
Code
library("scales")bball <-read.csv("https://raw.githubusercontent.com/alexCardazzi/alexcardazzi.github.io/main/econ311/data/bball_allstars.csv")# Create a function to randomly sample n times# and calculate the squared difference from the guess.# The function then returns the average of all n draws.sim <-function(my_pick, n){ sqr_diff <-rep(NA, n)for(i in1:n){ random_player <-sample(1:nrow(bball), 1) sqr_diff[i] <- (my_pick - bball$HEIGHT[random_player])^2 }mean(sqr_diff)}# I am going to come up with MANY guesses.# I am going to start 6 inches below the mean# and increase by a tenth of an inch until# I reach 6 inches above the mean.the_picks <-seq(mean(bball$HEIGHT) -6, mean(bball$HEIGHT) +6, by =0.1)avg_sqr_diffs <-NULLfor(pick in the_picks){ avg_sqr_diffs[length(avg_sqr_diffs) +1] <-sim(pick, 100)}plot(the_picks, avg_sqr_diffs, las =1,xlab ="Guess for Height", pch =19,ylab ="Average Squared Error",col =alpha("dodgerblue", 0.33))abline(v =mean(bball$HEIGHT), col ="tomato")legend("bottomleft", "Average Height", lty =1, col ="tomato", bty ="n")
Plot
Conditional Mean
Like the student told us their GPA and we adjusted our guess about their SAT, we would do the same if I told you I would only randomly sample from WNBA players. Let’s rerun the simulation, except only sampling from the WNBA distribution.
Code
sim <-function(my_pick, n){ sqr_diff <-rep(NA, n)for(i in1:n){ random_player <-sample(which(bball$LEAGUE =="WNBA"), 1) sqr_diff[i] <- (my_pick - bball$HEIGHT[random_player])^2 }mean(sqr_diff)}the_picks <-seq(mean(bball$HEIGHT) -6, mean(bball$HEIGHT) +6, by =0.1)avg_sqr_diffs <-NULLfor(pick in the_picks){ avg_sqr_diffs[length(avg_sqr_diffs) +1] <-sim(pick, 100)}plot(the_picks, avg_sqr_diffs, las =1,xlab ="Guess for Height", pch =19,ylab ="Average Squared Error",col =alpha("dodgerblue", 0.33))abline(v =mean(bball$HEIGHT), col ="tomato")abline(v =mean(bball$HEIGHT[bball$LEAGUE =="WNBA"]), col ="mediumseagreen", lty =2)legend("topleft", c("Average Height", "Average WNBA Height"),lty =1:2, col =c("tomato", "mediumseagreen"), bty ="n")
Plot
Conditional Mean
In both cases, you can see that the points reach a (near) minimum error when the “guess” is near the mean of the respective distribution.
Keep in mind that our goal here was to minimize the sum (or mean) of squared errors. We will revisit this soon, but first, we are going to jump back to GPAs and SAT scores.
Conditional Mean
Our goal as guidance counselors is to give students our best guess for what they’ll score on their SAT. In essence, we want a function that converts a GPA into an SAT score. We want this function to have a few properties:
SAT score must be dependent on GPA.
SAT score should increase as GPA increases (since they’re positively correlated).
Given an average GPA, the model should output an average SAT.
The model should be correct on average.
Like we did with the law of demand, we are going to guess a functional form for the relationship between GPA and SAT:
This functional form satisfies our first two conditions right off the bat. However, we still need to select values for \(\beta_0\) and \(\beta_1\) to satisfy the third and fourth conditions.
Conditional Mean
This equation of a line gives us the conditional expectation of SAT given GPA. This is most often written as \(E[ \ \text{SAT} \ | \ \text{GPA} \ ]\). You might say, “that’s great, but what values should we use for \(\beta_0\) and \(\beta_1\)?” Let’s pick a few different values and plot them along with the data.
Code
df <-read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")plot(df$hs_gpa, df$sat_sum, las =1, pch =19,col =alpha("black", 0.2),xlab ="GPA", ylab ="SAT")abline(a =10, b =30, lty =1, col ="tomato", lwd =2)abline(a =40, b =20, lty =2, col ="orchid", lwd =2)abline(a =70, b =10, lty =3, col ="dodgerblue", lwd =2)legend("bottomright", c("B0 = 10; B1 = 30","B0 = 40; B1 = 20","B0 = 70; B1 = 10"),col =c("tomato", "orchid", "dodgerblue"),lty =1:3, bty ="n", lwd =2)
Plot
No matter what values we choose for \(\beta_0\) and \(\beta_1\), there’s never going to be a single line that runs through every one of the points. To fix this, let’s write the equation as follows:
\[\text{SAT}_i = \beta_0 + \beta_1\text{GPA}_i + \epsilon_i\] where \(\epsilon_i\) (epsilon) is our error term. This will make up the difference between the point and the line we draw through the points.
Line of Best Fit
Ideally, we wouldn’t need \(\epsilon\) at all and our model would perfectly fit the data. However, since we do need \(\epsilon\), we want to minimize it.Specifically, like we did when guessing heights of basketball players, we are going to minimize the sum of squared errors, or \(\sum \epsilon_i^2\).Note that we can calculate the errors associated with choosing \(\beta_0\) and \(\beta_1\) by the following: \(\epsilon_i = \text{SAT}_i - \beta_0 - \beta_1 \text{GPA}_i\).
Since the errors will be minimized, the line given by \(\beta_0\) and \(\beta_1\) will be colloquially known as the “line of best fit”.
Ordinary Least Squares
The algorithm we are going to use to solve for \(\beta_0\) and \(\beta_1\) is called Ordinary Least Squares, or OLS. To begin, let’s start with writing an expression for the sum of squared errors.
\[S = \sum \epsilon^2= \sum (Y - \beta_0 - \beta_1X)^2 \] When we want to minimize an expression, we must take the derivative and set it equal to zero. Since \(\beta_0\) and \(\beta_1\) are unknown, these are what we need to take the derivative with respect to. First, we will focus on \(\beta_0\).
Ordinary Least Squares
The following equations have text if you hover over them:
This expression gives us the expectation of a student’s SAT score given a student’s GPA. We know for sure that this minimizes the sum of squared errors (because all of that math), but is the model correct on average?
What does the model look like relative to the data?
Code
plot(df$hs_gpa, df$sat_sum, las =1, pch =19,col =alpha("black", 0.2),xlab ="GPA", ylab ="SAT")abline(a =10, b =30, lty =1, col ="tomato", lwd =2)abline(a =40, b =20, lty =2, col ="orchid", lwd =2)abline(a =70, b =10, lty =3, col ="dodgerblue", lwd =2)abline(a = beta0, b = beta1, lty =1, col ="gold", lwd =4)legend("bottomright", c("B0 = 10; B1 = 30","B0 = 40; B1 = 20","B0 = 70; B1 = 10","OLS"), ncol =2,col =c("tomato", "orchid", "dodgerblue", "gold"),lty =c(1:3, 1), bty ="n", lwd =c(2, 2, 2, 4))
Plot
In fact, our initial guess of SAT = 70 + 10*GPA was pretty good, but OLS found a better line.
Ordinary Least Squares
In terms of interpretation, what does this equation mean in words?
\(\beta_1\) (11.36): If we increase GPA by one unit (e.g., from 2.2 to 3.2), we expect the sum of our SAT percentiles to increase by 11.36. In more abstract terms, “a one unit increase in \(X\) will change \(Y\) by \(\beta_1\).”
\(\beta_0\) (66.99): If someone has a GPA of 0, we expect their sum of SAT percentiles to be equal to 66.99. In more abstract terms, “the value of \(Y\) when \(X=0\).”
As long as “a one unit change” makes sense for \(X\), \(\beta_1\) will be interpretable. For \(\beta_0\) to be interpretable, \(X\) must be able to take on a reasonable value of zero. In this case, since a GPA of zero is nearly impossible, \(\beta_0\) does not have much of an interpretation.
Similarly, I built this little app where you can compare whatever regression line you want against the one via OLS. I guarantee you will never outperform OLS.
Statistical Inference
Just like how we tested if our sample mean was statistically different from some hypothesized value1 in the last module, we can test if our regression coefficients are different from hypothesized values as well.
Focusing on \(\beta_1\), and skipping the math, we can write the standard error of \(\beta_1\) as follows:
To test if your estimate of \(\beta_i\) is different from another number, we will first generate a test statistic:
\[t = \frac{\beta_i - \#}{se_{\beta_i}}\]
Generally, people choose 0 for \(\#\), since a coefficient of 0 would imply that the regressor (e.g., GPA) has no relationship with the outcome variable (e.g., SAT score). This way, the hypothesis test is set up such that \(\beta_i\) is either related to \(Y\) or it is not.
Once we calculate this test statistic, we can calculate a p-value and decide whether to reject the null hypothesis that our regressor is not related to the outcome variable.
Statistical Inference
As a final note about inference (i.e., hypothesis tests), we need to make two important assumptions about our errors (\(\hat{\epsilon}_i\)) in order for the above t-statistic to be valid.
Normality: we assume that our errors are independent, random draws from a normal distribution.
Homoskedasticity: this big word means that the normal distribution we draw our errors from has a constant variance across all points. See below for examples of homoskedastic errors and heteroskedastic errors.
Plot
Goodness of Fit
Once we estimate \(\beta_0\) and \(\beta_1\), we can also begin to talk about how well the model “fits” the data. In other words, what fraction of \(Y\)’s variance is explained by \(X\).
Before getting too much further with this, I want to be very clear that goodness-of-fit measures do not determine if your model is good. You may be able to explain a lot of \(Y\)’s variation, and still have a bad model. In addition, your model might have poor overall explanatory power, but still be a perfect explainer of what you’re interested in. We will discuss this more as the course progresses.
Goodness of Fit
Below are three equations for Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR). Hover over the equations for explanations.
Once we have defined these three measures, we can connect all three by the following equation:
\[SST = SSE + SSR\]
If our model is very good at explaining the variation in \(Y\), we would expect SSE to be very large relative to SSR. For a perfect model, we would see \(SST = SSE\). Therefore, we can create a ratio between these two numbers, which will give us a percentage of the variation that is explained by the model. We call this ratio \(R^2\), and define it as follows: